Goto

Collaborating Authors

 open-domain dialog system


Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing Systems

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r> .7,


Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing Systems

The paper attempts to move away from traditional evaluation of open-domain dialog systems (i.e., judge response given its conversation history) and moves towards a more interactive one (i.e., human talking to a bot), which is likely an important step towards better evaluation. However, I do have several serious concerns about this work in its current form: (1) The authors contrast their work with existing evaluation for open-domain dialog evaluation, which they call "single-turn" evaluation. They point out that this type of evaluation prevents it from capturing "failure modes […] such as a lack of diversity in the responses, inability to track long-term aspects of the conversation". I think this is rather misleading and the term is "single-turn" is a misnomer. Most previous work has indeed evaluated each conversation by factorizing it into a sequence of independent turn-level judgments, but each of these judgments assesses the quality of the current turn T_n **given** a history of several previous turns …, T_n-k, … T_n-1.


Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing Systems

This paper explores interesting directions, in particular 1) using interactive settings to evaluate a model rather than a single answer, and 2) combining different automated metrics in a weighted sums to approximate human evaluation (e.g., based on sentiment). Reviewers have raised crucial points, regarding gameability (so that using the metrics for training a model is tricky if not followed by a non-gameable evaluation), and lack of comparability between different self-play. It's indeed a much better evaluation setting if the system does not control both sides (e.g., models being matched to the same set of fixed models), so authors should definitely follow that direction. However, I expect this work would still be interesting to the dialog community: many of the diagnostic advantages of the model-talking-to-model setting remain, in practice, especially because the model is in fact not trained with the self-play objective, but that criterion is only used post hoc (so the system can't extensively exploit it during training). In practice, a lot of the problems of the generations of a given model already show up during self-play, and the reasonable worry raised by reviewers that the model could exploit the metric remains theoretical at the moment.


Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing Systems

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r .7,


Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Ghandeharioun, Asma, Shen, Judy Hanwen, Jaques, Natasha, Ferguson, Craig, Jones, Noah, Lapedriza, Agata, Picard, Rosalind

Neural Information Processing Systems

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r .7,


Challenges in Building Intelligent Open-domain Dialog Systems

Huang, Minlie, Zhu, Xiaoyan, Gao, Jianfeng

arXiv.org Artificial Intelligence

There is a resurgent interest in developing intelligent open-domain dialog systems due to the availability of large amounts of conversational data and the recent progress on neural approaches to conversational AI. Unlike traditional task-oriented bots, an open-domain dialog system aims to establish long-term connections with users by satisfying the human need for communication, affection, and social belonging. This paper reviews the recent works on neural approaches that are devoted to addressing three challenges in developing such systems: semantics, consistency, and interactiveness. Semantics requires a dialog system to not only understand the content of the dialog but also identify user's social needs during the conversation. Consistency requires the system to demonstrate a consistent personality to win users trust and gain their long-term confidence. Interactiveness refers to the system's ability to generate interpersonal responses to achieve particular social goals such as entertainment, conforming, and task completion. The works we select to present here is based on our unique views and are by no means complete. Nevertheless, we hope that the discussion will inspire new research in developing more intelligent dialog systems.


RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

Tao, Chongyang (Peking University) | Mou, Lili (University of Waterloo) | Zhao, Dongyan (Peking University) | Yan, Rui (Peking University)

AAAI Conferences

Open-domain human-computer conversation has been attracting increasing attention over the past few years. However, there does not exist a standard automatic evaluation metric for open-domain dialog systems; researchers usually resort to human annotation for model evaluation, which is time- and labor-intensive. In this paper, we propose RUBER, a Referenced metric and Unreferenced metric Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance). Our metric is learnable, but its training does not require labels of human satisfaction. Hence, RUBER is flexible and extensible to different datasets and languages. Experiments on both retrieval and generative dialog systems show that RUBER has a high correlation with human annotation, and that RUBER has fair transferability over different datasets.